[1] "Hello World!"
Module 3: R
University of South Florida
You can print the value on the console with print() function.
print() function is implicitly called if not supplied.
You can use R to do basic math calculations.
Question: What is 500 * (100 / 2.5) + 770?
+), Subtraction (-)*), Division (/)^ or **)%%): Returns the remainder of a division%/%): Returns the integer quotient< or >)<= or >=)==) and Inequality (!=)!)&), OR (|)Exercise: Check
3 >= 5,TRUE == FALSE,TRUE & FALSE,(3 > 1) & (3 <= 5)TRUE | FALSE,!TRUE == FALSE!(3 > 1) & (3 <= 5)A scalar in R is a simplest data type and represents a single element, not a collection.
To combine multiple elements to a vector, use c() function.
[1] 1 2
[1] 2 3 4 6 7
[1] 3
Tip
In R, “scalar” is actually represented as a vector of length 1.
Colon : generates integer sequence from:to.
To check the number of elements of a vector, use length() function:
Vectorized operations mean vector-in-vector-out, as opposed to scalar operation (one-in-one-out).
In R, most operations are vector in mind.
For example, + and * are vectorized operators.
Question:
What do you get when c(2,3) * c(3,9)?
What’s the length of the output?
When the operands are of different lengths, the shorter one is recycled as many times as necessary.
When it cannot be recycled entirely, it still works but raises a warning message:
Note
Vectorized operation is much, much faster than iterating (looping) over each single (scalar) elements.
Avoid using loops and utilize vectorized operation whenever possible.
For computers, using vectorization or not is like a difference in our mental calculation between
What is wrong with c(1 + 1, 2 + 1, 3 + 1)? How can you make it better?
Answer how c(1,2) * 1:3 works.
REPL: Interactive programming
So far, we’ve done interactive programming, REPL:
On console: > means: “Waiting your command”
+ means: “Continue command”
CTRL + C to abort.
Script (Batch)
Instead, you can run the whole script outside of R environment, using Rscript
Running a script file on current R session:
Writing a complete script is your final goal in programming.
Tip
REPL for development, and script for production.
To run the (whole) script, Ctrl (Cmd) + Shift + Return
Use <- to bind a name (symbol) to an object.
Here, my_number is called symbol, or name of an object.
Tip
Style guide: though you still can use =, use <- for assignment.
Use = for specifying function arguments instead.
Some IDEs (i.e. RStudio / Positron / VScode) have Alt (Option) + - as a shortcut.
R has strict rules about a syntactic name (symbol).
You can’t use reserved words like TRUE, NULL, if, etc.
If you’d deliberately use non-syntactic names, use backtick `
Since objects cannot contain whitespace as symbol, there are some naming conventions.
snake_case
camelCase
PascalCase
Tip
It is better to make a short, self-explanatory name.
e.g. weight <- 15 is easier to understand than my_variable_quantity <- 15
readline() gemerates prompt for interactive input.
The response can be a value and assigned as an object.
bond_prices.interest_rateConsider the following code. What is the printed value of a?
The expression a * 2 is evaluated but not assigned to any variable, so a remains unchanged.
To store the result, you need to assign it:
Vector type: Common data types
Special type (non-vector): non-vectors
Vectors are the most important family of data types in R.
Vector is a data structure that stores multiple elements. It comes in two flavors:
NULL is not a vector, but often serves as zero length vector.
There are four primary types of atomic vector in R, and two others.
TRUE, FALSE, NAAttach L to treat the number as strict integer number.
Caution
numeric is a collective term for both double and integers but often used as if it were a synonym for “double” or “real number” in practice.
" or '[1] "character"
[1] "character"
Tip
Style guide: Use double quote " for character instead of ' if possible.
raw type: binary data type
complex type: complex numbers (e.g. 3 + 4i)
rarely needed in Finance
NA
NULL is intentional empty “placeholder” in RNA is considered as logical length 1 vector.
NULL is a special type (NULL), length 0
NaN a numeric missing value, length 1.
What are four primary types of atomic vector?
What are the types of a,b,c,d below?
Confirm your answer with typeof().
List is a generic vector that is not atomic.
Atomic vector can have only one type for its elements (Double, Integer, Logical, …)
List can hold multiple data types for its member (even list itself)
List can have atomic vectors as its elements, with varying lengths:
[1] "list"
Question: What is the length of my_list?
Attributes are metadata that is attached to R objects, providing additional information or functionality.
Common attributes:
Names: Labels for elements in a vector or list
Dimensions (dim): Used for matrices, arrays
Class: Defines how an object should be treated by functions
etc.
Elements of vector (atomic, generic) can be named.
There are roughly three ways to assign names attribute.
In R, often used attributes has its own access function named after its own, such as names(), class(), dim().
Or use attr() function to set attribute:
Or assign names by construct:
Give names attribute to stock_prices vector using aforementioned three methods.
names(obj) <-attr(obj, "names") <-Adding a dim attribute to a vector allows it to behave like a 2-dimensional matrix or a multi-dimensional array.
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
The class attributes in R is used to define the behavior of objects with functions.
Especially important classes in Finance are:
Very important class in Finance.
They are built from double type atomic vector (type), but has own specific rule for uses (class).
Check their data type:
Check their attributes: they have class attributes.
To directly access the class attribute:
For an example, see how it works with + function.
[1] "2026-02-09"
[1] "2026-02-08 15:43:36 EST"
Q: Why +1 yield different results?
A: Because they are in different classes. +1 is inferred differently.
Class attribute gives some context how it should behave with functions.
In the deep down, they are just numbers:
Time has “tzone” attribute (time-zone) that controls “formatting” of date-time.
[1] "2026-02-08 15:43:36 EST"
[1] 1770583416
[1] "2026-02-08 20:43:36 UTC"
[1] 1770583416
attr(,"tzone")
[1] "UTC"
Factors (or Categorical) can only have a set of predefined values.
[1] "integer"
[1] "factor"
If they were stripped off all attributes:
A class built on top of list type, with 2D tabular representations
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
[1] "list"
[1] "data.frame"
If class attribute was removed: it turns back to list
Let’s browse the attributes of iris dataframe:
Tibble is a robust dataframe class.
as_tibble()# A tibble: 3 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
Those “building-block” types in R have implicit class:
Their class is not shown in the attributes(), but still shown when explicitly asked with class()
typeof(c(1,2,3)) and typeof(c(1L, 2L, 3L)). What’s the difference?my_first_objectMySecondObjectmy_first_object with MySecondObject. What do you get?my_namemy_name + 3? (Expect error)Since class determines the behavior of the object, it is crucial to know your data class especially performing function calls.
Note
Calling a function means executing/applying a function.
As you cannot use add function on character and numeric.
Other example: c()
Some functions coerces the class / type instead.
[1] "character"
[1] "character"
Tip
c() does coercion when inputs are in different types, but not all functions do coercion.
Let’s see how coercion works with c().
What is the type of test?
What is the class of test?
In general, coercion is performed in fixed order:
Character (heavy) << Double << Integer << Logical (light)
What would be the type of:
Given the same length, logical and integer takes least amount of memory, then double, then character
Single square brackets [ and ]
Double brackets [[ and ]]
Square brackets [ selects multiple elements of vector
Elements can be accessed with index:
Use double brackets [[ on vector when you want to select single element (Scalar).
Tip
Style guide: while single bracket on vector still works, use [[ on vectors to reinforce your expectation.
Subsetting vector can be combined with assignment <- to modify selected values.
AAPL MSFT GOOGL AMZN
150 205 250 303
AAPL MSFT GOOGL AMZN
235 205 250 303
Assigning multiple elements:
Access 1st and 3rd element of stock_prices
Assign “MSFT” and “AAPL” to 300.
You can subset vector with logical vector inside brackets:
AAPL MSFT GOOGL AMZN
FALSE FALSE TRUE FALSE
Generate below stock prices:
Subset and get the following:
There are 3 ways to index lists that each has own merits:
[ ] : returns original (list) type[[ ]] : returns element’s type$A list is like a train carring multiple cars:
Single bracket returns a list object, train.
Double bracket returns element’s type, car.
$ Operator$ is a shorthand operator for double bracket [[ with a variable name.
")Construct a list of portfolio:
Subset by number index:
Subset by name:
You can use chained bracket operation and assignment:
You can remove a component of list by assigning NULL
Assigning NULL to an element of a vector doesn’t work:
Error in prices[[1]] <- NULL: replacement has length zero
Should use negative indexing & overwrite in this case:
Create a list object portfolio as:
Subset the portfolio to:
Everything that exists is an object.
Everything that happens is a function call.— John Chambers
Fuction calls in R come in four varieties.
[[ for if, and don’t have consistent structure.An intersting property of R is that every inflix, replacement, and special form can be rewritten in prefix form.
` to wrap the function symbolOur first prefix function call, c() concatenates all the values and generate a single object.
c() has arbitrary number of arguments (...)seq() function generates a sequence of numbers.
from and to and by.Some functions has pre-defined argument value:
You can define custom function (User-defined function) in R with the following syntax:
The function can be called in prefix form:
Default argument value can be assigned by construction:
Q: What would happen if user calls above function with
my_first_function()By default, R evaluates each line as in individual statement.
Using curly braces { } allows you to group multiple expressions into a single unit that executes together.
Prep: Load stringr package with library(stringr)
Write your own function say_hello() that takes no argument. It prints “Hi!” when called.
Now tweak the function to accept an argument, name. It prints “Hello, {name}!”
str_glue("Hello, {name}!")return() in function can be served as an early exit: all remaining code won’t be executed.
A function has three parts:
The formals(): list of arguments
The body(): code inside the function
The environment(): where you defined the function
Check formals(), body() and environment():
Some functions are found from external packages:
All R functions are built by someone, and documentation is typically provided.
For detailed description of any function, use ? followed by the function’s name.
For example, try below code in your console:
Or, use help()
The present value of a perpetuity, where the cash flow grows at a constant rate g, is given by:
[ PV_{PER} = ]
where
This formula applies when r > g.
You can design your perpetuity function in R with following syntax:
Let’s call the function above:
Define a perpetuity calculator function, pv_per(). What is the pv when PMT = $50,000, r = 4%, g = 0%?
What is the pv when PMT = $50,000, r = 4%, but g are 1%, 2%, 3%?
What happens if user doesn’t specify one argument?
You can set default values for arguments, allowing them to be omitted when calling the function.
Functions can do more complex calculations. Following the Black-Scholes put / call pricing formula, we can generate function as below:
bsm_price <- function(S0, K, r, T, sigma, type = "call") {
d1 <- (log(S0 / K) + (r + 0.5 * sigma^2) * T) / (sigma * sqrt(T))
d2 <- d1 - sigma * sqrt(T)
if (type == "call") {
return(S0 * pnorm(d1) - K * exp(-r * T) * pnorm(d2))
} else if (type == "put") {
return(K * exp(-r * T) * pnorm(-d2) - S0 * pnorm(-d1))
} else {
stop("Invalid option type. Use 'call' or 'put'.")
}
}Calculate price estimates with four scenarios:
Functions are typically named so they can be reused multiple times.
However, you can skip naming a custom function, and they are called anonymous function.
They are not stored as objects since they do not have assigned symbols (names).
Note
Syntactic sugar refers to a feature in programming that makes the code simple to read or write, without adding functionality.
(Anonymous) functions can be defined with syntactic sugar (concise expression):
Convert below perpetuity function (pv_per) to anonymous function:
A Motivating example
Solve below math problem. Describe your steps. What was the first and the last step?
\[ \sqrt{(2+4)^2 - 3 * 4} = ? \]
\[ \sqrt{(2+4)^2 - 3 * 4} \]
and then square it, and save it in your memoryand then subtract it from previous, and update your memoryand then square root the valueSimilarly, codes can be written not in the order we calculate.
It is easier for us to read & write code in the order it is operated.
When we have composite function calls such as
The call sequence is x -> k() -> h() -> g() -> f().
It is rather easier to read, write and debug if we can write a code like:
This is where pipe operator |> becomes handy in R.
The pipe operator does “and then” job, and it can be written as:
Tip
Style guide: use |> instead of %>%. Use shortcut Cmd (Ctrl) + Shift + M.
Sometimes you’ll see %>% operator instead, which comes from external library in R (magrittr), meanwhile |> is R native. In order to use %>%, external package library(magrittr) should be imported.
Solve \(\sqrt{2^3}\) using pipe operator.
sqrt() functioncube that does x^32.Packages are add-on libraries that extend the functionality of R.
install.packages()library()Installing packages:
Load packages: you need to load packages to use its functionality.
Control structure dictates which code gets executed and when.
if statements: Execute code if a condition is true.else/else if statements: Execute code if the condition is false.for loops: Repeat code block a specified number of times.while loops: Continue executing code as long as a condition is true.The basic form of if and if-else statement in R:
Example 1: if and else executes code based on logical conditions.
Example 2: If condition is not met, then nothing happens (skipped).
Example 3: else if checks one more logic condition:
Example 4: There can be multiple else if
Example 5: else is executed when all of if conditions are not met.
[1] "The stock price has decreased."
Write an if-else statement:
PMT <- PMT + 10000)What is the outcome of above if-else, if initial PMT was 750?
For loops are used when code has to be iterated a specified number of times.
If for loop was explictly written:
for assigns the item in the current environment, overwriting existing variable with the same name.
Items are accessed one by one in vector in for loop:
To use index for each element: use seq_along() on the vector.
[1] 1 2 3 4
When looping over date / times, loops strip the attributes:
To workaround, use indexing with seq_along() and [[.
Memory Preallocation is creating the full size of the output object before the loop.
For example:
Important tips when looping:
[] instead of c()Best Example
[[1]]
[1] 1
[[2]]
[1] 4
[[3]]
[1] 9
[[4]]
[1] 16
[[5]]
[1] 25
[[6]]
[1] 36
[[7]]
[1] 49
[[8]]
[1] 64
[[9]]
[1] 81
[[10]]
[1] 100
Bad Example
Good Example
If preallocation is cumbersome, use list() for output container then convert to a vector if needed.
N <- 5000
list_prealloc <- vector("list", length = N)
list_noalloc <- list()
vector_prealloc <- vector("numeric", length = N)
vector_noalloc <- numeric()
bench::mark(
list_prealloc_bracket = for (n in 1:N) {
list_prealloc[[n]] <- n**2
},
list_noalloc_bracket = for (n in 1:N) {
list_noalloc[[n]] <- n**2
},
vector_prealloc_bracket = for (n in 1:N) {
vector_prealloc[[n]] <- n**2
},
vector_noalloc_bracket = for (n in 1:N) {
vector_noalloc[[n]] <- n**2
},
list_noalloc_c = for (n in 1:N) {
list_noalloc <- c(list_noalloc, n**2)
},
vector_noalloc_c = for (n in 1:N) {
vector_noalloc <- c(vector_noalloc, n**2)
},
list_prealloc_c = for (n in 1:N) {
list_prealloc <- c(list_prealloc, n**2)
},
vector_prealloc_c = for (n in 1:N) {
vector_prealloc <- c(vector_prealloc, n**2)
},
iterations = 5,
check = FALSE
)# A tibble: 8 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 list_prealloc_bracket 893µs 926µs 1053. 54.8KB 0
2 list_noalloc_bracket 898µs 941µs 1025. 830.1KB 0
3 vector_prealloc_bracket 809µs 895µs 695. 54.8KB 139.
4 vector_noalloc_bracket 841µs 925µs 1079. 830.1KB 0
5 list_noalloc_c 413ms 639ms 1.53 286.4MB 31.8
6 vector_noalloc_c 127ms 263ms 4.37 286.4MB 92.7
7 list_prealloc_c 357ms 639ms 1.54 286.4MB 32.6
8 vector_prealloc_c 124ms 220ms 4.35 286.4MB 93.1
When performing loops:
Preallocation + bracket indexing [] is the best.
No preallocation is forgivable.
Repeated use of c() is strongly discouraged.
Generally used with if-else condition tests inside loop.
next is used to skip an iteration of loop.
break is used to exit loop immediately.
How to calculate compound interest over multiple years using a for loop?
[1] 10500.00 11025.00 11576.25 12155.06 12762.82 13400.96 14071.00 14774.55
[9] 15513.28 16288.95
Based on the previous example, do the following:
Q1. Skip the first year using if and next
Q2. Stop the calculation if value exceeds $14,000
While loops begin with testing condition, and iterates the code as long as the condition is TRUE.
Write code that print 1 to 10 using for loop.
Achieve same result using while loop instead.
a <- 1 outside of the loopBased on 2, tweak the code that skips printing number if it is 5.
Write a function that checks class of an input.
If the input is numeric, print “Numeric input!”, otherwise, print “Not numeric!”
inherits(x, "numeric") for logical test.map function from purrr is an implicit function loop.
f is an input arg for map()for loopsmap() requires tidyverse or purrr packageNote
Functions that take other function as inputs are called functionals in R, like map().
Remember, though, if vectorized operation is possible, avoid using for loops or map.
map()With for loop, code tends to be longer and requires preallocation.
Generate times_two() function that multiplies input by 2.
map times_two function over 1:10
Achieve same result with for loop.
If the desired output is not list but atomic vector:
map_dbl() a numeric (double) vectormap_chr() a character vectormap_lgl() a logical vectormap_int() an integer vectorMost function operations in R are vectorized by default.
for loops or map if vectorization is possibleExample: portfolio value
A for loop approach:
A map approach:
* function to two input vectors (price, share)map2() for this case, see ?map2 for more infobench::mark(
vectorizing = {portfolio_value <- stock_prices * shares_held},
map2 = {portfolio_value <- map2(stock_prices, shares_held, `*`)},
for_loop_prealloc = {
portfolio_value <- vector("numeric", N) # container
for (i in 1:N) {
portfolio_value[[i]] <- stock_prices[[i]] * shares_held[[i]]
}},
iterations = 100,
check = FALSE
)# A tibble: 3 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 vectorizing 41ns 82ns 4533503. 0B 0
2 map2 46µs 53.51µs 18094. 264B 0
3 for_loop_prealloc 887µs 1.01ms 983. 20.1KB 30.4
Generate portfolio value of each asset, using:
ifelse() function is a vectorized if else statement.
Example: Dividend Payments
Practice ifelse():
sentiment object.case_when() from tidyverse package is a general vectorized if-else.
[1] "Bear" "Bull" "Weak" "Bear" "Normal" "Bear"
Practice case_when():
sentiment2 object.fs package provides simple and consistent way to:
A path is a string of characters used to uniquely identify a file or folder in a file system.
Types of paths:
The working directory is the location where the program (R, bash, Python, etc) is running on.
getwd() shows the current working directory.setwd("/path/to/directory") changes working directory to specified path./ in Mac/Linux, C:\ in Windows)/Users/username/Documents/project/data.csvC:\Users\username\Documents\project\data.csvAbsoulte paths are unambiguous.
./data/project/data.csv (The . denotes the current directory).: the current directory.
/Users/john/projects, ./data refers to /Users/john/projects/data...: the parent directory; one level up from the current directory.
/Users/john/projects, ../data refers to /Users/john/data.~: the home directory.
The “default” directory for user in operating system
Mac/Linux: /Users/<username>. - Example: If your username is john, your home directory would be /Users/john.
Windows: C:\Users\<username> - Example: If your username is john, your home directory would be C:\Users\john
~Represents the user’s home directory.
Example: ~/cases refers to the cases folder in the user’s home directory:
C:\Users\<username>\cases for windows/Users/john/cases for Mac/LinuxCreatie / delete file and directory are simple:
List files and directories:
It’s especially useful with globbing / regex:
fs_exercise.R on your working directory.List all files that has .R file extension.
What is the absolute path of the script file?
From your home(~), what is the relative path of the script file?
A plain, human-readable text data file, delimited by a specific character
,\tA text data file typically looks like:
, for CSV, \t for TSV)Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
Many packages support writing/reading csv/tsv files;
utils package): basic, slowreadr from tidyverse: extremely fast, functionalTo write a data.frame to a csv file: write_csv()
To read a .csv / .tsv file to a data.frame: read_csv(), read_tsv()
There are other common data formats:
One of the most important data class in R, built on top of list type.
Stores data structure in 2D tabular form:
with rows (observations, or records)
and columns (variables)
Columns can be different types!
Creating a data.frame is almost identical to list.
Create a dataframe named as housing:
What are the type (class) of each column automatically recognized by R?
str(housing).Q: What should be their type (class) in theory?
Essentially the same as dataframe class, with some fix:
as_tibble() converts data.frame class to tibble class.
A toy dataset, iris dataframe:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
Class of iris:
Convert iris to tibble class:
# A tibble: 3 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
iris_tb is a multi-class object that is both tibble and dataframe.
As it is built on lists, [, [[, $ also works on data.frames.
tibble classDataframe can be subsetted with df[i,j]
i part operates on row (called filtering)j part selects columns[, [[ subsetting1st row, 1st column, in element class (numeric vector)
Caution
Double bracket [[ works with , only when both row and columns are mentioned. That is, iris_tb[[1,1]] works, but iris_tb[[,1]] doesn’t.
To pull in element’s (vector) class, you’ll learn pull().
Other examples:
If comma is not provided, it assumes a column index.
$ subsetting
$ pulls a single column in element’s class from data frame (tibble).
Generate iris_tb by converting iris using as_tibble().
Exercise all subsetting methods on rows in the 2nd column:
iris_tb[2]iris_tb[[2]]iris_tb[colname]iris_tb[[colname]]$nrow() to check the number of rows.Subsetting can be chained
iris_tb[45, c("Petal.Length", "Sepal.Length")]On iris_tb,
filtered_irisQ. Confirm the average of Petal.Length from filtered_iris is 4.74.
mean(dataframe$column).They operate as same as lists. To assign a new variable within the data.frame, use:
To remove a variable from the data.frame, use:
R package for dataframe manipulation tasks.
[, [[, $ in most casesCompany_financials.csv data will be available in our GitHub Class repository.
select() verbselect() lets you choose specific columns.
Suppose you want to select all current items that starts with “current”.
or ends with “libabilities”.
or contains “asset”
relocate() verbrelocate() is used to change the order of columns.
rename() verbrename() changes the column names.
pull() verbpull() extracts a single column as a vector.
Using the dataset fin_data:
Create new tibble that includes only “ticker”, “Industry”, “year”, “market_cap” and column that starts with “current”.
Relocate column “market_cap” as the first column.
Rename “market_cap” to “Market_Cap” column.
Pull “ticker” as a vector from fin_data.
Combine 1 to 4 with pipe chain to achieve all at once.
filter() verbfilter() lets you choose rows based on conditions.
Apply filters across columns: if_any() and if_all()
Imagine your dataset includes multiple asset columns (e.g., current assets and current liabilities). You want to filter rows where any asset value exceeds $100B (1e11).
if_all() for strict filtering:
slice() verbslice() extrats rows based on simple positions.
disctinct() verbdistinct() removes duplicate rows based on referred columns.
From fin_data:
Filter that contains only rows where year is greater than 2022 and Industry is “Financials”.
Filter rows where any of columns that contains “asset” exceeds $100B (1e11).
Filter rows where all of columns that contains “current” exceed $10B (1e10).
Slice first 3 rows of the data.
Show distinct values of “Industry” in the data, and keep other columns.
arrange() verbarrange() reorders the rows by one or more columns.
Caution
Best practice: DO NOT CHAIN ARRANGE - it resets reordering.
mutate() verbmutate() lets you create or modify columns.
Using general if-else with case_when() to classify:
From fin_data:
Arrange the data by ticker (ascending) and year.
Create new variable “debt_to_asset_ratio” as the ratio of current_debt to current_assets.
summarize() verbsummarize() computes statistics for the entire dataset.
You can summarize by groups:
Or simply use .by in the summarize()
From fin_data:
Arrange the dataset by ticker and year in ascending order. Then, group the data by ticker.
Use mutate() along with the lag() function to calculate the year-over-year growth rate for current_assets. Name the variable as yearly_asset_growth.
\(\frac{\mathrm{current_assets} - \mathrm{lag(current_assets)}}{\mathrm{lag(current_assets)}}\)
Key verbs
select() : select subset of columnsrename() : rename columnsrelocate() : change column positionspull() : extract single column as vectorfilter() : select subset of rows with conditionslice() : extract specific rowsdistinct() : remove duplicate rowsarrange() : reorder rows
mutate() : add new columns (variables)
summarize() : generate summary table
group_by() / ungroup()A mini finance project
From now on, I’ll introduce how to leverage AI for coding.
Generating code snippets
Troubleshoot and debug
Best Practices
dplyr and tidyverseLearn to:
Let’s optimize our crypto investments!
Our list of cryptos: 9 sample
How many observations are found for each crpto?
Calculate daily returns with arrange(), group_by() and mutate()
# A tibble: 6 × 9
symbol date open high low close volume adjusted daily_ret
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ADA-USD 2020-01-01 0.0328 0.0338 0.0327 0.0335 22948374 0.0335 NA
2 ADA-USD 2020-01-02 0.0335 0.0335 0.0324 0.0328 20843934 0.0328 -0.0211
3 ADA-USD 2020-01-03 0.0327 0.0344 0.0325 0.0342 30162644 0.0342 0.0436
4 ADA-USD 2020-01-04 0.0342 0.0347 0.0339 0.0346 29535781 0.0346 0.0121
5 ADA-USD 2020-01-05 0.0346 0.0354 0.0345 0.0347 21479178 0.0347 0.00364
6 ADA-USD 2020-01-06 0.0348 0.0373 0.0347 0.0373 37988444 0.0373 0.0735
Calculate performance metrics with group_by() and summarize()
# A tibble: 9 × 3
symbol avg_daily_ret vol_daily
<chr> <dbl> <dbl>
1 ADA-USD 0.00356 0.0588
2 BNB-USD 0.00421 0.0570
3 BTC-USD 0.00150 0.0379
4 DOGE-USD 0.00807 0.135
5 DOT-USD 0.00270 0.0678
6 ETH-USD 0.00333 0.0505
7 MATIC-USD 0.00668 0.0805
8 SOL-USD 0.00548 0.0793
9 XRP-USD 0.00247 0.0634
Visualize performance metrics with ggplot(), to generate barplot:
performance_metrics |>
ggplot(
aes(x = fct_reorder(symbol, -avg_daily_ret), y = avg_daily_ret, fill = symbol )
) +
geom_col() +
scale_y_continuous(labels = scales::percent_format())+
labs(
title = "Average Daily Return of Cryptos",
subtitle = "Year 2020 - 2022",
caption = "Data: Yahoo Finance",
x = "Crypto",
y = "Average Return",
fill = "Symbol"
) +
theme_minimal()Similarly for volatility:
performance_metrics |>
ggplot(
aes(x = fct_reorder(symbol, vol_daily), y = vol_daily, fill = symbol )
) +
geom_col() +
scale_y_continuous(labels = scales::percent_format())+
labs(
title = "Average Daily Return of Cryptos",
subtitle = "Year 2020 - 2022",
caption = "Data: Yahoo Finance",
x = "Crypto",
y = "Average Return",
fill = "Symbol"
) +
theme_minimal()To combine and juxtapose (simple):
performance_metrics |>
mutate(symbol = fct_reorder(symbol, desc(avg_daily_ret))) |>
pivot_longer(cols = !symbol) |> # make long form
ggplot(
aes(x = symbol, y = value, fill = name)
) +
geom_col(position = "dodge") +
scale_y_continuous(labels = scales::percent_format()) +
labs(
title = "Average Daily Return / Volatility of Cryptos",
subtitle = "Year 2020 - 2022",
caption = "Data: Yahoo Finance",
x = "Crypto",
y = "Average Return / Volatility (%)",
fill = "Metric"
) To combine and juxtapose (advanced):
# Since return is smaller: scale by their max values
scale_factor <- max(performance_metrics$avg_daily_ret) / max(performance_metrics$vol_daily)
performance_metrics |>
ggplot(aes(x = fct_reorder(symbol, -avg_daily_ret))) +
geom_col(
aes(y = avg_daily_ret, fill = "Average Return"),
position = position_nudge(x=-0.2), # move to left
width = 0.4
) +
geom_col(
aes(y = vol_daily * scale_factor, fill = "Volatility"), # notice the scale factor
position = position_nudge(x=0.2), # move to right
width = 0.4) +
scale_y_continuous(
name = "Average Return (%)",
labels = scales::percent_format(),
sec.axis = sec_axis(
\(x) x / scale_factor,
name = "Volatility (%)",
labels = scales::percent_format())
) +
labs(
title = "Average Return and Volatiliy, Dual Axis",
subtitle = "Year 2020 - 2022",
caption = "Data: Yahoo Finance",
x = "Crypto",
fill = "Metric"
) +
theme_bw()A simple univariate portfolio sorting to see if “trading volume” predicts future returns.
# A tibble: 6 × 3
symbol avg_daily_volume volume_rank
<chr> <dbl> <int>
1 ADA-USD 1874739609. 3
2 BNB-USD 1554934937. 2
3 BTC-USD 36702325339. 5
4 DOGE-USD 1712255288. 3
5 DOT-USD 1392992622. 2
6 ETH-USD 18924519134. 4
Let’s test if volume explains future crypto returns:
Join rank (from past observation) to future crypto data using “symbol” as key
# A tibble: 6 × 11
symbol date open high low close volume adjusted daily_ret
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ADA-USD 2023-01-02 0.250 0.256 0.247 0.254 159328803 0.254 NA
2 ADA-USD 2023-01-03 0.254 0.255 0.251 0.253 153555529 0.253 -0.00407
3 ADA-USD 2023-01-04 0.253 0.270 0.252 0.268 289945179 0.268 0.0589
4 ADA-USD 2023-01-05 0.268 0.270 0.264 0.269 175511469 0.269 0.00532
5 ADA-USD 2023-01-06 0.269 0.279 0.268 0.279 326480796 0.279 0.0355
6 ADA-USD 2023-01-07 0.279 0.280 0.273 0.277 166488086 0.277 -0.00555
# ℹ 2 more variables: avg_daily_volume <dbl>, volume_rank <int>
Generate average crypto daily return by volume rank:
# A tibble: 5 × 2
volume_rank avg_daily_ret_by_volume_sorting
<int> <dbl>
1 1 0.0228
2 2 0.0109
3 3 0.0126
4 4 0.00829
5 5 0.0121
Visualize (bar plot): basic plot
Problems:
Finalizing plot:
portfolio_analysis |>
mutate(volume_rank = as.factor(volume_rank)) |>
ggplot(aes(x = volume_rank, y = avg_daily_ret_by_volume_sorting, fill = volume_rank)) +
geom_col() +
labs(
title = "Average Daily Crypto return by Volume Sorting",
subtitle = "January 2023 ",
caption = "Quintile Volume Sorting from 2020-2022 Data",
x = "Volume Rank (1 Low 5 High Volume)",
y = "Average Daily Return (%)",
fill = "Volume Rank"
) +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_brewer(palette = "Set1") +
theme_bw()Notes
This analysis for a demo. For a more rigor, consider:
Factors represent categorical variables that contains a fixed and known set of possible values.
They’re useful when you want to display character vectors in a specific, non-alphabetical order.
Note
I introduce forcats::fct() instead of base R’s factor(), which improves its behavior.
Factors solve two common problems with character vectors:
Typos and invalid entries: Factors restrict inputs to predefined categories.
Sorting: Factors can sort according to a custom order, rather than alphabetically.
Use the forcats::fct() function from the forcats package (part of the tidyverse):
Sorting factors respects the defined level’s sequence, as if it was an order:
If values not in the levels appear, forcats::fct() raises an error:
If levels are not mentioned when defined, it honors initial input ordering:
[1] "USA" "Canada" "South Korea"
Tip
Base R’s factor() doesn’t behave like this, but uses alphabetical ordering, which is discouraged behavior.
level is very important attribute of factor objects.
To modify or browse level attribute, use levels(). If levels change, corresponding values are recoded.
[1] "AAA" "AA" "A" "BBB" "BB" "B" "CCC"
[1] 4 2 3 7
Levels: 1 2 3 4 5 6 7
AAA becomes 1, AA becomes 2, and so on, preserving the order.
A convenient and safer way to recode is to use fct_recode()
rating_levels <- c(
"AAA", "AA", "A",
"BBB", "BB", "B", "CCC") # possible values
bond_ratings <- fct(
c("AAA","BBB", "A", "BB", "AA"),
levels = rating_levels)
bond_ratings <- fct_recode(bond_ratings,
"Top Tier" = "AAA",
"High Grade" = "AA",
"Medium Grade" = "A",
"Lower Grade" = "BBB",
"Speculative" = "BB"
)
bond_ratings # B and CCC level remains[1] Top Tier Lower Grade Medium Grade Speculative High Grade
Levels: Top Tier High Grade Medium Grade Lower Grade Speculative B CCC
Use fct_collapse() to combine multiple levels:
[1] Investment Investment Investment Speculative Speculative Speculative
Levels: Investment Speculative
You can reorder level by using fct_relevel(). Any levels not mentioned will be left in their existing order, after the explicitly mentioned levels.
stock_returns <- tibble(
Ticker = fct(c("AAPL", "MSFT", "GOOG", "JPM", "BAC")),
Sector = fct(c("Technology", "Technology", "Technology", "Financial", "Financial")),
Return = c(0.12, 0.08, 0.10, 0.05, 0.04)
)
stock_returns |>
mutate(Reordered_Ticker = fct_relevel(Ticker, "GOOG")) |>
pull(Reordered_Ticker)[1] AAPL MSFT GOOG JPM BAC
Levels: GOOG AAPL MSFT JPM BAC
Use fct_reorder(f,x) to reorder factor level according to x. It doen’t change position of real value, but level!
[1] AAPL MSFT GOOG JPM BAC
Levels: BAC JPM MSFT GOOG AAPL
fct_other() lumps together infrequent levels to “other” category.
[1] AAA AA A BBB Other Other Other AA Other Other BBB Other
[13] Other A BBB BBB
Levels: A AA AAA BBB Other
fct_lump() lumps together infrequent levels to “other” category, by n or prop
[1] Other AA A BBB Other B CCC AA CCC B BBB Other
[13] Other A BBB BBB
Levels: A AA B BBB CCC Other
The fct_anon() function replaces the existing levels with anonymous (generic) labels.
Create a factor market_regime from the vector c("Bear", "Sideways", "Bull") such that the order is Bull, Sideways, then Bear.
Then, recode the levels to “Downturn”, “Flat”, and “Upturn”
Use fct_recode() to change the names.
Given the following tibble of regional sales data:
Reorder the region factor based on avg_sales in ascending order and create a bar plot showing average sales by region.
Use fct_reorder(region, avg_sales) inside ggplot(aes())
You have a factor investment_style with the following values:
Collapse the factor into two groups using fct_collapse()
Traditional: includes “Growth”, “Value”, “Blend”
Alternative: includes “Contrarian”, “Speculative”
Given a vector of currency codes:
Use fct_other() to lump together any currency other than “USD” as “Other” category.
Use fct_lump() to lump infrequent currency as “Other”, using n or prop
They contain three possible values: TRUE, FALSE, and NA. Used extensively in data filtering, comparisons, and conditional transformations.
Though NA is logical, since other atomic vectors (integer,double,character) can contain missing values. There are corresponding NAs for each types:
NA_integer_ for integer
NA_real_ for double
NA_character_ for character
R handles the type conversion automatically when needed, so users don’t need to use it manually.
Q1. What is TRUE & FALSE?
Q2. What is TRUE | FALSE?
Q3. What is TRUE & TRUE?
Q4. What is FALSE | FALSE?
Q5. What is TRUE | TRUE?
Q6. What is FALSE & FALSE?
NA represents missing data
Comparisons with NA return NA
Use is.na() to check for missing values.
Keep in mind the logic:
TRUE or whichever is TRUE
FALSE and whichever is FALSE
Guess the results:
Comparison operators: <, <=, >, >=, !=, ==
The modulo operator (%%) is very useful for testing the divisibility of numbers
[1] 2 4 6 8 10
[1] 1 3 5 7 9
New: Exclusive OR: xor
[1] FALSE FALSE FALSE TRUE FALSE
[1] TRUE FALSE FALSE TRUE TRUE
[1] TRUE TRUE TRUE FALSE FALSE
[1] TRUE TRUE TRUE FALSE TRUE
&: Element-wise AND
|: Element-wise OR
!: Negation
%in% OperatorChecks whether an element is found in other set.
&& and || are short-circuit operators
dplyr functions!Checking equivalence with == with numeric (real numbers) is discouraged:
When checking with ==:
Why?
1/49 or sqrt(2) with fixed numbers whith decimal placesThat’s why == was failing.
To compare real numbers, use dplyr::near() function.
There are two logical summaries: any() and all()
any(x) is equivalent of |
TRUE if any of x is TRUE including NAall(x) is equivalent of &
TRUE if all of x are TRUEFALSE if any of x is FALSE including NACheck below:
> on the vector 1:10.%%)Explain the result of TRUE & NA. What does this tell you about logical operations involving NA?
Suppose you have a vector of company tickers:
and portfolio:
Write code to determine which tickers in tickers are present in portfolio. Yield a logical vector.
?any() and ?all()Numeric vectors are the backbone of financial data. Numerics include:
We will use tidyverse verbs—to manipulate numeric data in real-world finance examples.
Sometimes numbers are stored as strings (characters), especially when data was imported from external sources.
parse_double() converts strings that are purely numeric
parse_number() extracts numeric parts from strings
parse_double() example:
parse_number() example:
These functions compare values element-wise (rowwise in tibble).
Modular arithmetic is useful for breaking down composite numbers.
%/% integer division (quotient)%% modulo operator (remainder)For example, convert a time value in HHMM format to hours and minutes:
In Finance, logarithmic returns are often used. In R, log() is natural log. log2() and log10() have base of 2 and 10.
Logarithmic (Log) Returns
Calculated as the natural logarithm of the ratio of consecutive prices:
\(r_{log} = \ln\left(\frac{P_t}{P_{t-1}}\right)\)
Log returns are additive over time, which makes cumulative calculations more straightforward.
Key Differences between arithmetic and logarithmic returns:
Additivity
Log returns can be summed get the cumulative return, arithmetic returns must be compounded
Approximation
For small returns, log returns are very similar to arithmetic returns, but the difference becomes significant for larger returns
Returns comparison
Let’s see how to compute cumulative returns using both methods.
Cumulative Returns
Rounding is key for reporting. Use round(), floor(), and ceiling().
cut() bins numeric values into discrete intervals with custom breaks.
dplyr::lead() and dplyr::lag() allow you refer to values just before or after.
Extract positions: first(), last(), nth()
For the price:
Round price to
Parse below character vector of prices properly:
In finance, tracking dates and times is critical for modeling transactions, trade dates, settlement dates, and market events.
Although dates and times seem straightforward, they involve complexities such as:
There are three types:
<date><dttm> also referred “POSIXct”<time> from hmsR doesn’t have a native class for time, but tidyverse (hms) offers it.
today() and now() creates date and datetime class vectors.
If external data has standard (i.e, ISO8601) date and datetime, read_csv() will automatically parse it.
If external data has an ambiguous format, you can manually specify the format to handle.
| Type | Code | Meaning | Example |
|---|---|---|---|
| Year | %Y |
4 digit year | 2021 |
%y |
2 digit year | 21 | |
| Month | %m |
Number | 2 |
%b |
Abbreviated name | Feb | |
%B |
Full name | February | |
| Day | %d |
One or two digits | 2 |
| Time | %H |
24-hour hour | 13 |
%M |
Minutes | 35 | |
%S |
Seconds | 45 | |
%I |
12-hour hour | 1 | |
%p |
AM/PM | pm | |
%Z |
Time zone name | America/Chicago |
Guess the correct Date/Time Format:
Some cases are not handled perfectly by datetime format such as:
lubridate package has nice handlers for those cases.
lubridate package has nice handlers for datetime as well. Timezone must be specified correctly.
trade_datetime_24 <- "2023-05-15 09:30:00"
trade_datetime_12 <- "May 15, 2023 09:30 AM"
trade_datetime_24_tz <- "2023-05-15 09:30:00 EST"
ymd_hms(trade_datetime_24) # UTC by default
ymd_hms(trade_datetime_24, tz = "EST") # set time zone at EST
mdy_hm(trade_datetime_12)
ymd_hms(trade_datetime_24_tz) # time zone should be mentionedTime zones is not just a formatting. It changes underlying values especially when datetime is parsed from strings.
If you’re American you’ll know “EST” for Eastern Standard Time, but both Austrailia and Canada also have EST!
R uses international standard, IANA time zones, {area}/{location}.
There are two scenarios that you want to change time zones:
with_tz() will keep the instance but change the time zone.
For following timezone, change time zone to Chicago keeping the instance.
force_tz() will keep the time formatting but change the instance.
For following timezone, change time zone to Chicago keeping the instance.
You can pull out individual parts of the date with the accessor functions.
What is the
of example_datetime?
In Finance, flooring date / time is often used to matche frequency and most relevant information at a specific time.
floor_date(), ceiling_date() and round_date()
last_traded_time <- ymd_hms("2024-09-08 13:33:45.653 EST", tz = "EST") # milliseconds
floor_date(last_traded_time) # by default, second
ceiling_date(last_traded_time)
floor_date(last_traded_time, unit = "10 seconds")
floor_date(last_traded_time, unit = "15 mins")
floor_date(last_traded_time, unit = "2 hours")America/New_YorkMissing values frequently appear in fianancial datasets.
Two types of missingness:
Explicit missing: values marked NA
Implicit missing: absent rows that should be
| company | year | quarter | revenue |
|---|---|---|---|
| AAPL | 2020 | 1 | 100 |
| AAPL | 2020 | 2 | NA |
| AAPL | 2020 | 3 | 110 |
| AAPL | 2021 | 1 | 200 |
| TSLA | 2021 | 1 | 210 |
| TSLA | 2021 | 2 | 220 |
Explicit misssing:
NA values on the revenueImplicit missing:
Generally, you want to reveal those implicit missing cases as explicit. tidyr::complete() is handy for this operation.
Note
tidyr is included in tidyverse.
# A tibble: 12 × 4
company year quarter revenue
<chr> <dbl> <dbl> <dbl>
1 AAPL 2020 1 100
2 AAPL 2020 2 NA
3 AAPL 2020 3 110
4 AAPL 2021 1 200
5 AAPL 2021 2 NA
6 AAPL 2021 3 NA
7 TSLA 2020 1 NA
8 TSLA 2020 2 NA
9 TSLA 2020 3 NA
10 TSLA 2021 1 210
11 TSLA 2021 2 220
12 TSLA 2021 3 NA
Since Q4 was missing for all, complete() fails to make every missing values explicit.
In this case, you can provide your own data.
There are roughly 3 methods to handle missing values in Finance:
NA to 0NANA with incremental values in betweenHeuristic approach where you simply know (or assume) NA values. ifelse() is useful technique.
# A tibble: 16 × 5
company year quarter revenue revenue_filled
<chr> <dbl> <dbl> <dbl> <dbl>
1 AAPL 2020 1 100 100
2 AAPL 2020 2 NA 100
3 AAPL 2020 3 110 110
4 AAPL 2020 4 NA 100
5 AAPL 2021 1 200 200
6 AAPL 2021 2 NA 100
7 AAPL 2021 3 NA 100
8 AAPL 2021 4 NA 100
9 TSLA 2020 1 NA 100
10 TSLA 2020 2 NA 100
11 TSLA 2020 3 NA 100
12 TSLA 2020 4 NA 100
13 TSLA 2021 1 210 210
14 TSLA 2021 2 220 220
15 TSLA 2021 3 NA 100
16 TSLA 2021 4 NA 100
Or you can use tidyr::replace_na() function.
# A tibble: 16 × 4
company year quarter revenue
<chr> <dbl> <dbl> <dbl>
1 AAPL 2020 1 100
2 AAPL 2020 2 100
3 AAPL 2020 3 110
4 AAPL 2020 4 100
5 AAPL 2021 1 200
6 AAPL 2021 2 100
7 AAPL 2021 3 100
8 AAPL 2021 4 100
9 TSLA 2020 1 100
10 TSLA 2020 2 100
11 TSLA 2020 3 100
12 TSLA 2020 4 100
13 TSLA 2021 1 210
14 TSLA 2021 2 220
15 TSLA 2021 3 100
16 TSLA 2021 4 100
tidyr::fill() offers convenient filling options. It works like select() function.
- "down": fill downwards (LOCF)
- "up": fill upwards (NOCB)
- "downup": LOCF then NOCB
- "updown": NOCB then LOCF
When filling in direction, grouping and arranging is important.
If you use LOCF, below is the correct approach:
Interpolation is when you want to estimate a value between two known points. approx() function is a handy tool.
By default, it makes 50 split along the length of the vector and give esimated values.
[1] 5
$x
[1] 1.000000 1.081633 1.163265 1.244898 1.326531 1.408163 1.489796 1.571429
[9] 1.653061 1.734694 1.816327 1.897959 1.979592 2.061224 2.142857 2.224490
[17] 2.306122 2.387755 2.469388 2.551020 2.632653 2.714286 2.795918 2.877551
[25] 2.959184 3.040816 3.122449 3.204082 3.285714 3.367347 3.448980 3.530612
[33] 3.612245 3.693878 3.775510 3.857143 3.938776 4.020408 4.102041 4.183673
[41] 4.265306 4.346939 4.428571 4.510204 4.591837 4.673469 4.755102 4.836735
[49] 4.918367 5.000000
$y
[1] 0.0100000000 0.0091836735 0.0083673469 0.0075510204 0.0067346939
[6] 0.0059183673 0.0051020408 0.0042857143 0.0034693878 0.0026530612
[11] 0.0018367347 0.0010204082 0.0002040816 -0.0006122449 -0.0014285714
[16] -0.0022448980 -0.0030612245 -0.0038775510 -0.0046938776 -0.0055102041
[21] -0.0063265306 -0.0071428571 -0.0079591837 -0.0087755102 -0.0095918367
[26] -0.0087755102 -0.0063265306 -0.0038775510 -0.0014285714 0.0010204082
[31] 0.0034693878 0.0059183673 0.0083673469 0.0108163265 0.0132653061
[36] 0.0157142857 0.0181632653 0.0206122449 0.0230612245 0.0255102041
[41] 0.0279591837 0.0304081633 0.0328571429 0.0353061224 0.0377551020
[46] 0.0402040816 0.0426530612 0.0451020408 0.0475510204 0.0500000000
You can get only certain observations with xout argument. Notice it generates a list output with x and y.
To pull the interpolated results, access y from the result.
It is easy to visualize the results from linear approximation:
If you have values to specify for x-axis to calculate slope:
For example, fill the straight line estimate for “9 month” yield.
The treasury daily yield data looks like below.
# A tibble: 4 × 4
date x6_mo x1_yr x2_yr
<date> <dbl> <dbl> <dbl>
1 2025-04-01 4.23 4.01 3.87
2 2025-04-02 4.24 4.04 3.91
3 2025-04-03 4.2 3.92 3.71
4 2025-04-04 4.14 3.86 3.68
To interpolate, you’ll need to pivot the data and make an explicit missing value:
# A tibble: 16 × 3
date name value
<date> <fct> <dbl>
1 2025-04-01 x6_mo 4.23
2 2025-04-01 x9_mo NA
3 2025-04-01 x1_yr 4.01
4 2025-04-01 x2_yr 3.87
5 2025-04-02 x6_mo 4.24
6 2025-04-02 x9_mo NA
7 2025-04-02 x1_yr 4.04
8 2025-04-02 x2_yr 3.91
9 2025-04-03 x6_mo 4.2
10 2025-04-03 x9_mo NA
11 2025-04-03 x1_yr 3.92
12 2025-04-03 x2_yr 3.71
13 2025-04-04 x6_mo 4.14
14 2025-04-04 x9_mo NA
15 2025-04-04 x1_yr 3.86
16 2025-04-04 x2_yr 3.68
Then, generate a numeric column to help interpolating the yield estimates.
# A tibble: 6 × 4
date name value days
<date> <fct> <dbl> <dbl>
1 2025-04-01 x6_mo 4.23 180
2 2025-04-01 x9_mo NA 270
3 2025-04-01 x1_yr 4.01 360
4 2025-04-01 x2_yr 3.87 720
5 2025-04-02 x6_mo 4.24 180
6 2025-04-02 x9_mo NA 270
Finally, interpolate with approx() function. Notice the use of group_by() in this operation.
# A tibble: 16 × 5
# Groups: date [4]
date name value days value_interpolated
<date> <fct> <dbl> <dbl> <dbl>
1 2025-04-01 x6_mo 4.23 180 4.23
2 2025-04-01 x9_mo NA 270 4.12
3 2025-04-01 x1_yr 4.01 360 4.01
4 2025-04-01 x2_yr 3.87 720 3.87
5 2025-04-02 x6_mo 4.24 180 4.24
6 2025-04-02 x9_mo NA 270 4.14
7 2025-04-02 x1_yr 4.04 360 4.04
8 2025-04-02 x2_yr 3.91 720 3.91
9 2025-04-03 x6_mo 4.2 180 4.2
10 2025-04-03 x9_mo NA 270 4.06
11 2025-04-03 x1_yr 3.92 360 3.92
12 2025-04-03 x2_yr 3.71 720 3.71
13 2025-04-04 x6_mo 4.14 180 4.14
14 2025-04-04 x9_mo NA 270 4
15 2025-04-04 x1_yr 3.86 360 3.86
16 2025-04-04 x2_yr 3.68 720 3.68
ifelse() to create a new column return_filled where missing returns are filled with 0, assuming no change in stock price on those days.tidyr::replace_na() to achieve the same result, replacing NA values with 0 in the return column.Use tidyr::complete() to add the missing quarters for 2020 and 2021. Assume that each year should have quarters 1 to 4 (Q1, Q2, Q3, Q4). The missing revenue values should appear as NA.
Fill the missing revenue values using the LOCF method. Ensure the data is properly arranged by year and quarter before applying tidyr::fill().
approx()function to interpolate the yield for “9 Mo” based on the days and yield columns. Provide the interpolated yield value as your answer.Characters (Strings) store text information in finance such as
We’ll mostly use stringr package (included in tidyverse)
You can create strings by wrapping values with singgle quote (') or double quotes (").
There are multiple string printing functions in R: print(),cat() and str_view()
print() gives you full structure of underlying string
cat() shows rendered string output
str_view() shows rendered output (robust)
Special characters (quotes, backslash, backticks, etc.) has their reserved use, and if you want to include them, you have to escape with backslash \.
There are some other special characters worth remembering:
\n newline\t tab\U Unicode escapesCreating a string with multiple quotes, backslashes, gets confusing so quickly! For example:
Without raw strings,
Double backslashes \\ and
double double quotes '""' with quotes
will make you crazy.
This is called Leaning Toothpick Syndrome
To eliminate escaping, you can use raw string with r"()", r"{}", r"[]".
Create strings that contain the following values:
str_c() concatenates multiple string vectors, element-wise.
For example, combine a financial report header with
str_glue() improves readability by allowing embedded expressions within {}.
Earnings: Microsoft Inc. reported strong results.
Also works with vectorized operations with recycling.
If you want to collapse a vector of strings into a single string, str_flatten(), or paste()
[1] "Therewillbeastrongmarketvolatility."
[1] "There will be a strong market volatility."
Base R: paste() and collapse.
Flatten the above companies character vector into a scalar string.
Fix below code to evaluate embed expression companies then print:
Two relevant concepts related to the length:
length()str_length() or nchar()You can extract parts of a string using position arguements with str_sub()
[1] "Ap" "Sa" "Ze"
[1] "." "C" "s"
[1] "e Inc." "son LLC" " Investments"
[1] " Inc." "n LLC" "ments"
str_pad() pads a string to fixed length by adding extra whitespace on the left, right or both.
You can pad other strings, for example, leading zeros:
Upper / lowercase transfromations:
Extract the first two characters from companies.
Extract the last two characters.
Transform to lowercases and uppercases.
Regular Expressions (Regex) is a language for describing “patterns” within strings.
stringr, tidyverse, and base RWe’ll use regular expression functions from the stringr and tidyr packages, both core members of the tidyverse.
To explore regular expressions, we’ll use:
Three character vectors from the stringr package:
fruit: names of 80 fruitswords: 980 common English wordssentences: 720 short example sentencesThese built-in datasets are great for testing regex.
str_view() highlights matches in a string vector using <>.
Literal characters match exactly:
Some characters, like ., +, and *, have special meanings in regex and are known as metacharacters.
.: A wildcard that matches any single character. For example:
You can match specific lengths of text using . repeated:
[1] │ <apple>
[7] │ bl<ackbe>rry
[48] │ mand<arine>
[51] │ nect<arine>
[62] │ pine<apple>
[64] │ pomegr<anate>
[70] │ r<aspbe>rry
[73] │ sal<al be>rry
This matches an “a” followed by any three characters and an “e”.
Quantifiers control how often a pattern appears:
?: 0 or 1 time (optional)+: 1 or more times*: 0 or more timesUse str_view() to highlight pattern “ca”
Use str_view() to highlight pattern “ca” and following exactly one character (hint: .)
Use brackets [] to define and match sets of characters. It is also called as character class. For example, [aeiou] matches any vowel.
[284] │ <exa>ct
[285] │ <exa>mple
[288] │ <exe>rcise
[289] │ <exi>st
[836] │ <sys>tem
[901] │ <typ>e
The caret ^ inside brackets negates the set.
Caution
The caret ^ outside of brackets has a different meaning: it anchors the match to the beginning of the string.
Use | to match one of several patterns:
[1] │ <apple>
[13] │ canary <melon>
[20] │ coco<nut>
[52] │ <nut>
[62] │ pine<apple>
[72] │ rock <melon>
[80] │ water<melon>
[9] │ bl<oo>d orange
[33] │ g<oo>seberry
[47] │ lych<ee>
[66] │ purple mangost<ee>n
This finds fruits containing specified keywords or repeated vowels.
Use str_view() to highlight patten where “x” is surrounded by vowels
Use str_view() to match words containing any of “flux” or “pixie”
str_view() is good to experiment on pattern matching. Other key functions are:
str_detect(): logical check if pattern existsstr_subset(): subset elements that contains patternsstr_count(): count the occurrence of patternstr_replace(): replace patternsseparate_...(): separate by pattternsIn real data, you can use str_detect() to check for the presence of a pattern.
str_subset() and str_which()Two other useful functions are:
str_subset(): returns the elements that contains patternstr_which(): returns the number index of elements that has patternExample:
[1] "These days a chicken leg is a rare dish."
[2] "Rice is often served in round bowls."
[3] "A large size in stockings is hard to sell."
[4] "A rod is used to catch pink salmon."
[5] "The source of the huge river is the clear spring."
[6] "The fish twisted and turned on the bent hook."
[1] 4 5 10 12 13 22
You can use these to extract or locate matches without altering the original data structure.
str_count()Check for repeated sequences:
Count the number of matches per string:
Sometimes your results may look off. For example, the name “Aaban” has three “a”s, but only two are counted. That’s because regex is case sensitive by default.
You can fix this in three ways:
regex(..., ignore_case = TRUE):str_replace() replaces the first match. str_replace_all() replaces all match.
In tibble (dataframe), you can separate text into variables by
delimiter,position andpattern (regex)separate_longer_delim() separates values into long form.
separate_wider_delim() separates values into wide form. You must specify names, and actions if too few or too many.
separate_longer_position() splits by fixed width. Must specify width.
separate_wider_position() separates values into wide form.
You must specify widths with named integer vector, and actions if too few or too many.
When you want to separate by regex patterns. Below is a complex sample:
Use separate_wider_regex() to extract structured data:
str_detect() to indicate whether each string contains a digit (hint [0-9]).str_count() to count the number of vowels.str_replace_all() to replace “a” to “e”separate_wider to separate fruit to two variables by whitespace.fruit firstTo literally match metacharacters (., ?, *) in regex, use \.
., regex pattern should be \.\ with \"\\." to express \.To match ?, you need regex \?, and to express it \\?.
To match \, you need regex \\, and to express it \\\\
If you use raw strings in regex, it reduces one level of escaping.
Or you can escape with character set [] for some (not all) metacharacters.
\ cannot be used with character setIf you want to match at the start or end you need to anchors
^ match the start$ match the end\b match boundary between wordsWord boundary example:
Match end:
Anchors also match zero-width if used alone:
You can use this feature for replacements:
Use [] to match any character from a set.
[abc][^abc] to exclude[a-z] - defines range\ escapes special characters within [][1] │ <abc>d ABCD 12345 -!@#%.
[1] │ <abcd> ABCD 12345 -!@#%.
[1] │ abcd< ABCD >12345< -!@#%.>
[1] │ <a>-<b>-<c>
[1] │ <a><->b<-><c>
Some character sets are so common that they have shortcuts:
\d any digit\D anything not digit\s any whitespace (space, tab, newline)\S anything not whitespace\w any word (letters and numbers)\W any non-word[1] │ abcd ABCD <12345> -!@#%.
[1] │ <abcd ABCD >12345< -!@#%.>
[1] │ abcd< >ABCD< >12345< >-!@#%.
[1] │ <abcd> <ABCD> <12345> <-!@#%.>
[1] │ <abcd> <ABCD> <12345> -!@#%.
[1] │ abcd< >ABCD< >12345< -!@#%.>
On top of
?(0 or 1)+(1 or more)*(0 or more)You can specify precise quantifiers
{n} exactly n times{n,} at least n times{n,m} between n and m timesRegular expressions follow precedence rules like math:
+, ?): high|): lowYou can use () to specify precedence and grouping.
Parenthesis () can also used for capturing groups.
Use \1, \2, etc., to refer back to matched groups.
Use back references to replice:
# A tibble: 3 × 2
value new
<chr> <chr>
1 The birch canoe slid on the smooth planks. The canoe birch slid on the smoot…
2 Glue the sheet to the dark blue background. Glue sheet the to the dark blue b…
3 It's easy to tell the depth of a well. It's to easy tell the depth of a …
You can extract match with str_match() that returns matrix:
You can extract match with str_match() that returns matrix:
[,1] [,2] [,3]
[1,] "the smooth planks" "smooth" "planks"
[2,] "the sheet to" "sheet" "to"
[3,] "the depth of" "depth" "of"
Convert to tibble:
Or use separate_wider_regex()
When you want to use () purely for grouping, not for capturing:
(?:) is non-capturing groupHow would you match the literal string "'\ ? How about "$^$"
Given the corpus of common words in stringr::words, create regular expressions that find all words that:
Switch the first and last letters in words. Which of those strings are still words?
Describe in words what these regular expressions match. Rread carefully to see if each entry is a regular expression or a string that defines a regular expression.
^.*$"\\{.+\\}"\d{4}-\d{2}-\d{2}"\\\\{4}"\..\..\..(.)\1\1"(..)\\1"regex() gives more contol over pattern object, using flags.
Dotall flag allows . to match all, including \n.
Multiline makes ^ and $ match start and end of each line.
Comments allows you to write comment on complex patterns.
phone <- regex(
r"(
\(? # optional opening parens
(\d{3}) # area code capturing group
[)\-]? # optional closing parens or dash
\ ? # optional space
(\d{3}) # another three numbers group
[\ -]? # optional space or dash
(\d{4}) # four digits group
)",
comments = TRUE
)
str_extract(c("514-791-8141", "(123) 456 7890", "123456"), phone)[1] "514-791-8141" "(123) 456 7890" NA
Opt-out regular expression rules using fixed()
[1] │ <The> birch canoe slid on the smooth planks.
[4] │ <The>se days a chicken leg is a rare dish.
[6] │ <The> juice of lemons makes fine punch.
[7] │ <The> box was thrown beside the parked truck.
[8] │ <The> hogs were fed chopped corn and garbage.
[11] │ <The> boy was there when the sun rose.
[1] │ <The> birch canoe slid on the smooth planks.
[6] │ <The> juice of lemons makes fine punch.
[7] │ <The> box was thrown beside the parked truck.
[8] │ <The> hogs were fed chopped corn and garbage.
[11] │ <The> boy was there when the sun rose.
[13] │ <The> source of the huge river is the clear spring.
[3] │ <It>'s easy to tell the depth of a well.
[15] │ <He>lp the woman get back to her feet.
[27] │ <He>r purse was full of useless trash.
[29] │ <It> snowed, rained, and hailed the same morning.
[63] │ <He> ran half way to the hardware store.
[90] │ <He> lay prone and hardly moved a limb.
[3] │ <It>'s easy to tell the depth of a well.
[29] │ <It> snowed, rained, and hailed the same morning.
[63] │ <He> ran half way to the hardware store.
[90] │ <He> lay prone and hardly moved a limb.
[116] │ <He> ordered peach pie with ice cream.
[127] │ <It> caught its hind paw in a rusty trap.
How to spot such mistakes? Create few positive and negative examples and test.
If you wanted to find all sentences that mention a color?
[2] │ Glue the sheet to the dark <blue> background.
[26] │ Two <blue> fish swam in the tank.
[92] │ A wisp of cloud hung in the <blue> air.
[148] │ The spot on the blotter was made by <green> ink.
[160] │ The sofa cushion is <red> and of light weight.
[174] │ The sky that morning was clear and bright <blue>.
What if the colors are so many and stored in data, like:
First you want to remove numbers from colors:
Now you can generate patterns using R code:
To fetch news data, you’ll need API key from
The data.frame is stored in second level.
# A tibble: 6 × 9
author title description url url_to_image published_at content id
<chr> <chr> <chr> <chr> <chr> <dttm> <chr> <chr>
1 Axal S… Ital… "Italy Ban… http… https://med… 2026-02-07 20:20:35 "<d… <NA>
2 Waqas Bith… "A system … http… https://hac… 2026-02-07 20:08:59 "On 6 … <NA>
3 Oluwap… Trum… "The crypt… http… https://cry… 2026-02-07 20:05:46 "The c… <NA>
4 Editor Inte… "Podcast: … http… https://www… 2026-02-07 20:00:00 "Podca… <NA>
5 Diana … How … "By 2050, … http… https://www… 2026-02-07 20:00:00 "Young… <NA>
6 Editor… Mike… "Stablecoi… http… https://sta… 2026-02-07 19:35:52 "Stabl… <NA>
# ℹ 1 more variable: name <chr>
To filter financial news that mention “uncertain”:
Filter news that mention “uncertain” or “risk” or “option” or “down”,
# A tibble: 12 × 3
author title description
<chr> <chr> <chr>
1 Editor Interview 1999 – Gold Rush as Dollar Cra… "Podcast: …
2 Kurt Zindulka Half of British Voters Want Prime Minist… "Half of B…
3 Glenn Carle FO Exclusive: Global Lightning Roundup o… "Editor-in…
4 Bloomberg News Charting the Global Economy: ECB Holds, … "The Europ…
5 Juliana Kim DVDs and public transit: Boycott drives … "A sweepin…
6 Rafael Nam Trump promised a crypto revolution. So w… "Trump got…
7 The White Coat Investor 13 Reasons I Still Own Bonds "For some …
8 Reuters Iran's surging crypto activity draws US … "Crypto us…
9 Jake Simmons Kevin Warsh Will Trigger Bitcoin Regime … "Bitcoin’s…
10 James Halver Mining Stocks And Asian Markets Hit As B… "Bitcoin’s…
11 Everygame Casino Super Bowl Betting Promos: Everygame's L… "Everygame…
12 Bovada Super Bowl Betting Sites: Bovada's Welco… "An inform…
You can make sentiment polarity with simple lexicon matching:
positive_words <- c("gain", "rally", "beat", "surge", "growth", "record", "optimism", "strong")
negative_words <- c("loss", "fall", "miss", "drop", "decline", "weak", "concern", "crisis")
news_frame |>
mutate(
text = str_to_lower(title),
pos = str_count(text, str_c("\\b(", str_c(positive_words, collapse = "|"), ")\\b")),
neg = str_count(text, str_c("\\b(", str_c(negative_words, collapse = "|"), ")\\b")),
sentiment_score = pos - neg
)# A tibble: 96 × 13
author title description url url_to_image published_at content id
<chr> <chr> <chr> <chr> <chr> <dttm> <chr> <chr>
1 Axal … Ital… "Italy Ban… http… https://med… 2026-02-07 20:20:35 "<d… <NA>
2 Waqas Bith… "A system … http… https://hac… 2026-02-07 20:08:59 "On 6 … <NA>
3 Oluwa… Trum… "The crypt… http… https://cry… 2026-02-07 20:05:46 "The c… <NA>
4 Editor Inte… "Podcast: … http… https://www… 2026-02-07 20:00:00 "Podca… <NA>
5 Diana… How … "By 2050, … http… https://www… 2026-02-07 20:00:00 "Young… <NA>
6 Edito… Mike… "Stablecoi… http… https://sta… 2026-02-07 19:35:52 "Stabl… <NA>
7 Garet… As t… "As the We… http… https://liv… 2026-02-07 19:31:07 "Is Au… abc-…
8 Quent… Afte… "The specu… http… https://s.y… 2026-02-07 19:30:00 "Galax… <NA>
9 Bloom… Tech… "The bigge… http… https://sma… 2026-02-07 19:29:26 "(Bloo… fina…
10 Joe W… Prof… "\"Sharp b… http… https://fut… 2026-02-07 19:15:00 "Follo… <NA>
# ℹ 86 more rows
# ℹ 5 more variables: name <chr>, text <chr>, pos <int>, neg <int>,
# sentiment_score <int>
For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.
colors() contains a number of modifiers like “light”, “dark”, “medium” as in “lightgray” and “darkblue”. How could you automatically identify these modifiers?
The use of LLMs in financial data analysis can be very effective.
Cloud LLMs
ellmer supports cloud LLM backendsOPENAI_API_KEY)Local LLMs
ollamar and mall packageellmer also supports localellmerellmer connects to cloud/local LLMs.
Google gemini provides free tier APIs.
Get the API key
Set the chat machine
You can use in interactive mode with live_console() or live_browser().
Test out yourselves:
Core Principles for effective use of LLMs:
Give the Role & Context
Source Delimiters
Explicit Output Format
Determinism Controls
Define LLM’s role and financial context.
Examples
Wrap input text (news, filings) in triple back-ticks to clarity input boundaries.
Examples
news_frame$content[[1]]”I'm sorry, I cannot access the content of `news_frame$content[1]` directly. As
an AI model, I don't have the capability to execute code or access external
dataframes from your local environment.
Could you please provide the actual text you'd like me to analyze? Once you
paste the text, I'll be happy to give you a sentiment analysis!
Request structured outputs for your analysis. When multiple answers are expected, JSON format is recommended.
Examples
Set temperature = 0 to ensure consistent responses and limit token budget to control costs.
Note
Temperature in LLM is a paramter that controls randomness (0-2 range). Low level (0) gives consistent, predictable and rigid outputs. High level (1) gives create and varied responses.
Tokens are similar to word counts, that measures the weight of the information in input/output text.
Let’s setup a financial news analyzer machine with prompt engineering, as example below.
news_analyzer <- chat_google_gemini(
system_prompt = r"{
You are an expert financial analyst.
You will be provided news article title to analyze, which will be wrapped with tripple backticks ```.
Your task is to assess the market sentiment of a news article.
Return valid JSON with curly braces without any other formatting:
– "score": a real number between [0, 1] (0 = extremely negative, 1 = extremely positive).
– "rationale": less than 25 words.
Do not add any keys, text, or commentary outside the JSON object.
}",
# api_key = "Your_API_KEY",
api_args = list(
generationConfig = list(
temperature = 0,
maxOutputTokens = 100
)
)
)Prepare news dataframe from newsanchor.
As a test run:
Since the output always contains json markdown formatter, we can clean with regex.
jsonlite package pases json format strings.
Now, we can analyze sentiment of financial news titles.
Step 1: Build a function that
get_sentiment <- function(title){
prompt <- str_glue("Tell me the sentiment of this article: ```{title}```")
llm_response <- news_analyzer$chat(prompt, echo = "none")
clean_response <-
str_replace_all(llm_response, "\\n", "") |>
str_replace(
"^```json(.*)```$", # capturing group
"\\1"
) |>
fromJSON()
return(clean_response)
}Step 2: Map the function
Step 3: Tidy the data (unnest_wider())
Collect and analyze 10 financial news articles for market sentiment.
newsanchor with above topic.FIN4770: Programming for FinTech
Comments
Comments are meant to explain the code and redable, but ignored by the computer.
#to make comment#on the line are ignoredTip
Use
Ctrl (Cmd) + /hotkey to toggle comments.